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Abstract 

In this paper I describe some results on the use of virtual proces- 
sors technology for parallelize some SPMD computational programs 
in a cluster environment. The tested technology is the INTEL Hy- 
per Threading on real processors, and the programs are MATLAB 6.5 
Release 13 scripts for floating points computation. By the use of this 
technology, I tested that a cluster can run with benefit a number of 
concurrent processes double the amount of physical processors. The 
conclusions of the work concern on the utility and limits of the used 
approach. The main result is that using virtual processors is a good 
technique for improving parallel programs not only for memory-based 
computations, but in the case of massive disk-storage operations too. 



1 Introduction 

The processors virtualization technology permits to split a real physical pro- 
cessor into two virtual chips, so that the operating system, as MS Windows 
or Linux, of a computer can use the virtual processors as two real chips. Ex- 
ample of such technology is Intel's Hyper Threading [1]. The hardware can 
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so be considered as a symmetric multi-processor machine and the software 
can use it true parallel environment. 

In this work I show some results obtained with parallel computations us- 
ing Matlab [2] programs on Intel technology. A previous paper [3] describes 
the same cases for an older Matlab version and for a single dual processor 
machine. The physical and logical characteristics of the used cluster are pre- 
sented in the following tables: 



Hardware 


Type 


2 nodes HP Compaq ProLiant DL360 


Processors 


2 Intel Xeon 3.20 GHz for each node 


Ram 


2 GB for each node 


Network 


1 Gb switch for nodes connection 


Storage 


4 SCSI disks 36.5 GB - Raid 5 for each node 



Software 


Operating System 


MS Windows Server 2000 partition, 




SuSE Linux 8.1 partition 


Matlab 


v. 6.5.0 release 13 



The Matlab programs used for these experiments was based on cycles of 
floating-point computations. 

2 The parallel Matlab environment 

The package Matlab has not a native support for parallel elaboration and 
multithreading [4]. Yet, there are some extensions, as tools and libraries [5], 
for the use of a parallel environment on multi-processors hardware. With the 
cluster I have used the method of splitting a given computation on multiple 
instances of the runtime Matlab program. A single master instance starts the 
slave copies on nodes and assigns to each of them the same set of instructions 
on different sets of data. Hence in the cluster I have simulated a SPMD 
computation. 

In this way the parallel environment is simple, because there is not need 
of external libraries or calls to interfaces, and flexible, because to a single 
slave copy it can be assigned a set of different instructions for realizing a 
MPMD computation. 
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With this method the exchange of messages among independent processes 
is a problem. The only way to communicate from one Matlab copy to another 
is the use of shared files. In a second type of experiments I show that this 
method is not critical for the time execution if one uses fast mass-storage as 
SCSI or FiberChannel systems, and the nodes are connected in a fast private 
Lan. 

2.1 The SPMD programs 

In the experiments I have defined a master Matlab function which writes to 
a shared file system the .m scripts to be executed by slaves Matlab copies. 
These copies are launched in background mode for the parallel execution. 
The master program controls the end of the computations using a simple set 
of lock-files. The slaves finish their work, save on files the results and cancel 
the own lock-file. The master reads the sets of data from these files for other 
possible computations. Now I describe the principal code of the program. 

This is the declaration of the function masterf. The lockarray variable is 
an array for testing the presence of the lock-files during the slaves compu- 
tation. The finalres is an array for the collection of the partial results from 
slaves. The string computing is the mathematical expression to use in the 
computation. The array nodes contains the names of the cluster's machines 
and it's used for the remote startup of the Matlab engines. 

function [elapsedtime,totaltime,executiontime]=masterf(nproc,maxvalue, step, computing) 

% 

% MASTERF: master function for parallel background computation. 

% 

% sintax: 

% [elapsedtime,totaltime,executiontime]=masterf(nproc,maxvalue, step, computing) 

% 

% input parameters: 

% 

% nproc = number of processes; 

% maxvalue = sup-limitation of the data-array to process; the inf-limitation is 0; 
% step = difference from two consecutive numbers in the data-array; 
% computing = the string of the mathematical expression to compute; 

% 

% output parameters: 

% 

% elapsedtime = total elapsed time to complete the execution of the computation; 

% totaltime = sum of the single slaves CPU-time to complete the single computation; 

% executiontime = single slaves CPU-time to complete the assigned computation; 

ostype=computer; 
tottime=0.; 
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Iockarray=0:nproc-1; 
numbervalues=maxvalue; 
computingstring=[' ' computing]; 
flnalres=[]; nodes=[node01 node02]; 

After the assignment of the own value to variable workdir, working direc- 
tory of Matlab, a cycle writes on storage the slaves lock-files. 

for i=0:nproc-l 

filelock = strcat(workdir,'filelock',int2str(i)); 

nd=fopen(filelock,'wr'); 

fwrite(fid,"); 

fclose(fid); 

end 

In the next fragment of program, the master sets the commands for the 
writing of an appropriate Matlab .m script for every slave process. Such script 
contains the instruction for determining the CPU-time spent on calculus, 
the expression of the mathematical computation, the instruction to save on 
storage the data computed and the CPU-time, finally the instruction to delete 
the lock-file. 

for i=0:nproc-l 

if (i==0) middlestep=0; else middlestep=l; end 

infdata=i*(numbervalues/nproc) + middlestep*step; 

supdata=(i+l)*(numbervalues/nproc); 

flleworker = strcat(workdir,'fileworker',int2str(i),'.m'); 

commandworkertmp = ... 

strcat('x=',num2str(infdata),':',num2str(step),':',num2str(supdata),... 

'; tl=cputime; ',computingstring,... 

'; t2=cputime-tl; save out',int2str(i)); 
commandworker = ['cd ' workdir '; ' commandworkertmp ... 

' y t2; ' 'delete fllelock'int2str(i) '; exit;']; 
fid = fopen(fileworker,'wt'); 
fwrite(fid, commandworker); 
fclose(fid); 

end 

The following instructions are OS-dependent, and are necessary for the 
right setting of the command for remote startup of Matlab engines on nodes: 

switch ostype 
case 'PCWIN' 

osstring = 'dos'; 

workdir =st rcat ( mat labroot , ' \ work\' ) ; 
startcommand='rcmd'; 
case 'LNX86' 

osstring = 'unix'; 

workdir =strcat ( mat labroot , ' / work / ' ) ; 
startcommand='rsh'; 

end 
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After the instructions for determining the CPU-time and the elapsed-time 
(tic) spent by the master program, a cycle launches the same number of slaves 
Matlab runtimes on each node. In the case of Windows operating system, the 
startcommand string is "rcmd" , the OS command for the background running 
of an executable program on a remote machine, and the osstring string is 
"dos". In the case of Unix-like operating system, the string are "rsh" and 
"unix" respectively Each slave executes immediately the fileworker script, 
as shown by the Matlab "-r" parameter. The basic remote command is 
integrated by the name of the node, alterning the order of startup for a 
simple reason of load balancing. 

tl = cputime; 
tic; 

for i=0:nproc-l 

if (mod(i,2)==0), startcommand = [startcommand node02]; else ... 

startcommand = [startcommand nodeOl]; end; 
fileworker = strcat('fileworker',int2str(i)); 

commandrun = [startcommand ' matlab -minimize -r ' fileworker]; 
eval(strcat([osstring,'(',"",commandrun, "",');'])); 

end 

In the next fragment of code the master program executes a cycle for de- 
termining the end of slaves computations. It controls if the lockarray variable 
has some process's rank non negative. In this case, it attempts to open the 
relative lock-file; if the file still exists, the master closes it, else the lockarray 
process position is set to -1. The pause instruction can be useful for avoiding 
an excessive frequency, hence an high cpu-time consuming, in the "while" 
cycle. 

lockarraytmp=find(lockarray > -1); 
while (length(lockarraytmp) > 0) 
pause(.l); 
for i=lockarraytmp 

fid = fopen(strcat('filelock',int2str(i-l)),'r'); 
if (fid < 0) 

lockarray(i) = -1; 

else 

fclose(fid); 

end 

end 

lockarraytmp=find(lockarray > -1); 

end 

At the end, the master reads the partial slaves computation outputs and 
stores them in an array. At this point the master cpu-time and elapsed time 
are registered too. The total execution time is defined as sum of the slaves 
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computation cpu-time, and is useful for comparison with the execution time 
in the case nproc = 1. The single slave execution time is defined as the 
arithmetic mean of all the partial execution times. 

for i=0:nproc-l 

partialres = load(strcat('out',int2str(i))); 
finalres = [finalres partialres]; 

end 

elapsedtime = toe; 
totaltime = cputime - tl; 

for i=0:nproc-l 

tottime = tottime + partialres(i).t2; 
executiontime = tottime/nproc; 

end 



3 Tests and results 

For the tests I have used the following values for the masterf parameters: 

nproc: from 2 to 16, step=2 (even numbers only, for right balancing 

of the nodes load); 
maxvalue: m * 10000, where m — 2, 4, 6; 
step: 0.001; 

computing: y = 5432.060708 * cos((sin(a; 9 - 876 ))- L2345 ). 

I have also tested the program without the slaves saving of partial com- 
putations results and their final master load, for determining the influence 
of the I/O storage operations on the times of execution. 

In the following table, the values are expressed in seconds. The num- 
ber 2,..., 16 are the values of the nproc parameter. I have not reported the 
elapsed-times, because they weren't different from the cpu-times registered, 
probably due to the fact that, during the experiments, the cluster was dedi- 
cated only to the computations. 

In the case of no storage writing and reading of data results, the times 
are 20%-30% lower. The time values are those of MS-Windows case; in the 
Linux case the registered times are in general 15%-20% higher. This fact is 
probably due to a non optimized installation of Linux distribution on nodes. 
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Table 1. Total execution cpu-times, with data storage: 



m 


2 


4 


6 


8 


10 


12 


14 


16 


2 


48.29 


27.70 


32.51 


22.56 


28.14 


31.34 


33.28 


35.04 


4 


126.53 


65.21 


74.79 


54.27 


63.17 


74.29 


83.01 


91.34 


6 


263.37 


109.48 


121.30 


78.41 


116.23 


125.69 


138.51 


145.93 



4 Analysis of results 

From the results of the previous section, I deduce the following observations: 

1. The case nproc=8, hence the number of possible Hyper-Threading vir- 
tual processors based on the four physical chips, has the better perfor- 
mances for all the values of the m parameter; 

2. The case nproc=4, the number of physical processors in the cluster, 
has a local peak of performances for all the values of the m parameter; 

3. The speedup [6] seems to be better for increasing values of the param- 
eter to, hence for larger amount of data to be computed; in the case 
m=2 the speedup of 8 running processes over the case of 2 processes 
is about 2.14, while in the case m=6 the same speedup is about 3.35 
(quasi-linear speedup). 

In the Fig. 1 the graphs are interpolations of the Table 1. data. The 
peaks of performances at nproc=4 and nproc=8 are well visible, specially in 
the case m—6. 



4.1 Conclusions 

From the previous facts one can deduce that a virtual processors technol- 
ogy as Hyper Threading on a cluster environment can be a good choice for 
running SPMD programs in the case that 
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Figure 1: Graphs of Table 1. data 



• the number of parallel processes is equal to the number of virtual pro- 
cessors; 

• the data to be computed have a large amount, particularly when their 
distribution among processes and the merging of final results are based 
on files stored on fast storage system. 
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